SITE LINK

KMID : 1022420200120030055

Phonetics and Speech Sciences
2020 Volume.12 No. 3 p.55 ~ p.63

Voice-to-voice conversion using transformer network

Kim June-Woo

Jung Ho-Young

Abstract

Voice conversion can be applied to various voice processing applications. It can also play an important role in data augmentation for speech recognition. The conventional method uses the architecture of voice conversion with speech synthesis, with Mel filter bank as the main parameter. Mel filter bank is well-suited for quick computation of neural networks but cannot be converted into a high-quality waveform without the aid of a vocoder. Further, it is not effective in terms of obtaining data for speech recognition. In this paper, we focus on performing voice-to-voice conversion using only the raw spectrum. We propose a deep learning model based on the transformer network, which quickly learns the voice conversion properties using an attention mechanism between source and target spectral components. The experiments were performed on TIDIGITS data, a series of numbers spoken by an English speaker. The conversion voices were evaluated for naturalness and similarity using mean opinion score (MOS) obtained from 30 participants. Our final results yielded 3.52±0.22 for naturalness and 3.89±0.19 for similarity.

KEYWORD

voice conversion, transformer network, signal-to-signal conversion

FullTexts / Linksout information

Listed journal information

site infomation

Prohibition of Unauthorized Collection of E-mail Addresses, medric.kyung@gmail.com
N4 301, Chungbuk National University, Chungdae-ro 1, Seowon-Gu, Cheongju, Chungbuk 28644, Korea